{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting very large datasets meaningfully, using `datashader`\n",
    "\n",
    "There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the `datashader` library helps make large datasets truly practical.  \n",
    "\n",
    "We'll use part of the well-studied [NYC Taxi trip database](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml), with the locations of all NYC taxi pickups and dropoffs from the month of January 2015.  Although we know what the data is, let's approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load NYC Taxi data \n",
    "\n",
    "(takes 10-20 seconds, since it's in the inefficient but widely supported CSV file format...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "%time df = pd.read_csv('../data/nyc_taxi.csv',usecols= \\\n",
    "                       ['pickup_x', 'pickup_y', 'dropoff_x','dropoff_y', 'passenger_count','tpep_pickup_datetime'])\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.\n",
    "\n",
    "### Define a simple plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.models import BoxZoomTool\n",
    "from bokeh.plotting import figure, output_notebook, show\n",
    "\n",
    "output_notebook()\n",
    "\n",
    "NYC = x_range, y_range = ((-8242000,-8210000), (4965000,4990000))\n",
    "\n",
    "plot_width  = int(750)\n",
    "plot_height = int(plot_width//1.2)\n",
    "\n",
    "def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args):\n",
    "    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,\n",
    "        x_range=x_range, y_range=y_range, outline_line_color=None,\n",
    "        min_border=0, min_border_left=0, min_border_right=0,\n",
    "        min_border_top=0, min_border_bottom=0, **plot_args)\n",
    "    \n",
    "    p.axis.visible = False\n",
    "    p.xgrid.grid_line_color = None\n",
    "    p.ygrid.grid_line_color = None\n",
    "    \n",
    "    p.add_tools(BoxZoomTool(match_aspect=True))\n",
    "    \n",
    "    return p\n",
    "    \n",
    "options = dict(line_color=None, fill_color='blue', size=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1000-point scatterplot: undersampling\n",
    "\n",
    "Any plotting program should be able to handle a plot of 1000 datapoints.  Here the points are initially overplotting each other, but if you hit the Reset button (top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample.  If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots.  We've included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks.  In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called *undersampling*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from bokeh.tile_providers import STAMEN_TERRAIN\n",
    "\n",
    "samples = df.sample(n=1000)\n",
    "p = base_plot()\n",
    "p.add_tile(STAMEN_TERRAIN)\n",
    "p.circle(x=samples['dropoff_x'], y=samples['dropoff_y'], **options)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10,000-point scatterplot: overplotting\n",
    "\n",
    "We can of course plot more points to reduce the amount of undersampling.  However, even if we only try to plot 0.1% of the data, ignoring the other 99.9%, we will find major problems with *overplotting*, such that the true density of dropoffs in central Manhattan is impossible to see due to occlusion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "samples = df.sample(n=10000)\n",
    "p = base_plot()\n",
    "\n",
    "p.circle(x=samples['dropoff_x'], y=samples['dropoff_y'], **options)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Overplotting is reduced if you zoom in on a particular region (may need to click to enable the wheel-zoom tool in the upper right of the plot first, then use the scroll wheel).  However, then the problem switches to back to serious undersampling, as the too-sparsely sampled datapoints get revealed for zoomed-in regions, even though much more data is available.\n",
    "\n",
    "### 100,000-point scatterplot: saturation\n",
    "\n",
    "If you make the dot size smaller, you can reduce the overplotting that occurs when you try to combat undersampling.  Even so, with enough opaque data points, overplotting will be unavoidable in popular dropoff locations.  So you can then adjust the alpha (opacity) parameter of most plotting programs, so that multiple points need to overlap before full color saturation is achieved.  With enough data, such a plot can approximate the probability density function for dropoffs, showing where dropoffs were most common:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "%%time\n",
    "options = dict(line_color=None, fill_color='blue', size=1, alpha=0.1)\n",
    "samples = df.sample(n=100000)\n",
    "p = base_plot(webgl=True)\n",
    "p.circle(x=samples['dropoff_x'], y=samples['dropoff_y'], **options)\n",
    "show(p)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../assets/images/nyc_taxi_100k.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[*Here we've shown static output as a PNG rather than a live Bokeh plot, to reduce the file size for distributing full notebooks and because some browsers will have trouble with plots this large.  The above cell can be converted into code and executed to get the full interactive plot.*]\n",
    "\n",
    "However, it's very tricky to set the size and alpha parameters.  How do we know if certain regions are saturating, unable to show peaks in dropoff density? Here we've manually set the alpha to show a clear structure of streets and blocks, as one would intuitively expect to see, but the density of dropoffs still seems approximately the same on nearly all Manhattan streets (just wider in some locations), which is unlikely to be true. We can of course reduce the alpha value to reduce saturation further, but there's no way to tell when it's been set correctly, and it's already low enough that nothing other than Manhattan and La Guardia is showing up at all. Plus, this alpha value will only work even reasonably well at the one zoom level shown. Try zooming in (may need to enable the wheel zoom tool in the upper right) to see that at higher zooms, there is less overlap between dropoff locations, so that the points *all* start to become transparent due to lack of overlap. Yet without setting the size and alpha to a low value in the first place, the stucture is invisible when zoomed out, due to overplotting. Thus even though Bokeh provides rich support for interactively revealing structure by zooming, it is of limited utility for large data; either the data is invisible when zoomed in, or there's no large-scale structure when zoomed out, which is necessary to indicate where zooming would be informative.\n",
    "\n",
    "Moreover, we're still ignoring 99% of the data.  Many plotting programs will have trouble with plots even this large, but Bokeh can handle 100-200,000 points in most browsers. Here we've enabled Bokeh's WebGL support, which gives smoother zooming behavior, but the non-WebGL mode also works well. Still, for such large sizes the plots become slow due to the large HTML file sizes involved, because each of the data points are encoded as text in the web page, and for even larger samples the browser will fail to render the page at all.  \n",
    "\n",
    "\n",
    "### 10-million-point datashaded plots: auto-ranging, but limited dynamic range\n",
    "\n",
    "To let us work with truly large datasets without discarding most of the data, we can take an entirely different approach. Instead of using a Bokeh scatterplot, which encodes every point into JSON and stores it in the HTML file read by the browser, we can use the [datashader](https://github.com/bokeh/datashader) library to render the entire dataset into a pixel buffer in a separate Python process, and then provide a fixed-size image to the browser containing only the data currently visible. This approach decouples the data processing from the visualization.  The data processing is then limited only by the computational power available, while the visualization has much more stringent constraints determined by your display device (a web browser and your particular monitor, in this case). This approach works particularly well when your data is in a far-off server, but it is also useful whenever your dataset is larger than your display device can render easily.\n",
    "\n",
    "Because the number of points involved is no longer a limiting factor, you can now use the entire dataset (including the full 150 million trips that have been made public, if you download that data separately).  Most importantly, because datashader allows computation on the intermediate stages of plotting, you can easily define operations like auto-ranging (which is on by default), so that we can be sure there is no overplotting or saturation and no need to set parameters like alpha.\n",
    "\n",
    "The steps involved in datashading are (1) create a Canvas object with the shape of the eventual plot (i.e. having one storage bin for collecting points, per final pixel), (2) aggregating all points into that set of bins, incrementally counting them, and (3) mapping the resulting counts into a visible color from a specified range to make an image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datashader as ds\n",
    "from datashader import transfer_functions as tf\n",
    "from datashader.colors import Greys9\n",
    "Greys9_r = list(reversed(Greys9))[:-2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)\n",
    "agg = cvs.points(df, 'dropoff_x', 'dropoff_y',  ds.count('passenger_count'))\n",
    "img = tf.shade(agg, cmap=[\"white\", 'darkblue'], how='linear')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting image is similar to the 100,000-point Bokeh plot above, but (a) makes use of all 12 million datapoints, (b) is computed in only a tiny fraction of the time, (c) does not require any magic-number parameters like size and alpha, and (d) automatically ensures that there is no saturation or overplotting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "img"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This plot renders the count at every pixel as a color from the specified range (here from white to dark blue), mapped linearly. If your display device were linear, and the data were distributed evenly across this color range, then the result of such linear, auto-ranged processing would be an effective, parameter-free way to visualize your dataset.\n",
    "\n",
    "However, real display devices are not typically linear, and more importantly, real data is rarely distributed evenly.  Here, it is clear that there are \"hotspots\" in dropoffs, with a very high count for areas around Penn Station and Madison Square Garden, relatively low counts for the rest of Manhattan's streets, and apparently no dropoffs anywhere else but La Guardia airport.  NYC taxis definitely cover a larger geographic range than this, so what is the problem?  To see, let's look at the histogram of counts for the above image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def histogram(x,colors=None):\n",
    "    hist,edges = np.histogram(x, bins=100)\n",
    "    p = figure(y_axis_label=\"Pixels\",\n",
    "               tools='', height=130, outline_line_color=None,\n",
    "               min_border=0, min_border_left=0, min_border_right=0,\n",
    "               min_border_top=0, min_border_bottom=0)\n",
    "    p.quad(top=hist[1:], bottom=0, left=edges[1:-1], right=edges[2:])\n",
    "    print(\"min: {}, max: {}\".format(np.min(x),np.max(x)))\n",
    "    show(p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "histogram(agg.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly, most of the pixels have very low counts (under 3000), while a very few pixels have much larger counts (up to 22000, in this case).  When these values are mapped into colors for display, nearly all of the pixels will end up being colored with the lowest colors in the range, i.e. white or nearly white, while the other colors in the available range will be used for only a few dozen pixels at most.  Thus most of the pixels in this plot convey very little information about the data, wasting nearly all of dynamic range available on your display device.  It's thus very likely that we are missing a lot of the structure in this data that we could be seeing.\n",
    "\n",
    "\n",
    "### 10-million-point datashaded plots: high dynamic range\n",
    "\n",
    "For the typical case of data that is distributed nonlinearly over the available range, we can use nonlinear scaling to map the data range into the visible color range.  E.g. first transforming the values via a log function will help flatten out this histogram and reveal much more of the structure of this data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "histogram(np.log1p(agg.values))\n",
    "\n",
    "tf.shade(agg, cmap=Greys9_r, how='log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now see that there is rich structure throughout this dataset -- geographic features like streets and buildings are clearly modulating the values in both the high-dropoff regions in Manhattan and the relatively low-dropoff regions in the surrounding areas.  Still, this choice is arbitrary -- why the log function in particular?  It clearly flattened the histogram somewhat, but it was just a guess.  We can instead explicitly equalize the histogram of the data before building the image, making structure visible at every data level (and thus at all the geographic locations covered) in a general way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "histogram(tf.eq_hist(agg.values))\n",
    "    \n",
    "tf.shade(agg, cmap=Greys9_r, how='eq_hist')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The histogram is now fully flat (apart from the spacing of bins caused by the discrete nature of integer counting). Effectively, the visualization now shows a rank-order or percentile distribution of the data.  I.e., pixels are now colored according to where their corresponding counts fall in the distribution of all counts, with one end of the color range for the lowest counts, one end for the highest ones, and every colormap step in between having similar numbers of counts.  Such a visualization preserves the ordering between count values, faithfully displaying local differences in these counts, but discards absolute magnitudes (as the top 1% of the color range will be used for the top 1% of the data values, whatever those may be).\n",
    "\n",
    "Now that the data is visible at every level, we can immediately see that there are some clear problems with the quality of the data -- there is a surprising number of trips that claim to drop off in the water or in the roadless areas of Central park, as well as in the middle of most of the tallest buildings in central Manhattan. These locations are likely to be GPS errors being made visible, perhaps partly because of poor GPS performance in between the tallest buildings.\n",
    "\n",
    "Histogram equalization does not require any magic parameters, and in theory it should convey the maximum information available about the relative values between pixels, by mapping each of the observed ranges of values into visibly discriminable colors.  And it's clearly a good start in practice, because it shows both low values (avoiding undersaturation) and relatively high values clearly, without arbitrary settings.  \n",
    "\n",
    "Even so, the results will depend on the nonlinearities of your visual system, your specific display device, and any automatic compensation or calibration being applied to your display device.  Thus in practice, the resulting range of colors may not map directly into a linearly perceivable range for your particular setup, and so you may want to further adjust the values to more accurately reflect the underlying structure, by adding additional calibration or compensation steps.\n",
    "\n",
    "Moreover, at this point you can now bring in your human-centered goals for the visualization -- once the overall structure has been clearly revealed, you can select specific aspects of the data to highlight or bring out, based on your own questions about the data.  These questions can be expressed at whatever level of the pipeline is most appropriate, as shown in the examples below.  For instance, histogram equalization was done on the counts in the aggregate array, because if we waited until the image had been created, we would have been working with data truncated to the 256 color levels available per channel in most display devices, greatly reducing precision.  Or you may want to focus specifically on the highest peaks (as shown below), which again should be done at the aggregate level so that you can use the full color range of your display device to represent the narrow range of data that you are interested in.  Throughout, the goal is to map from the data of interest into the visible, clearly perceptible range available on your display device.\n",
    "\n",
    "\n",
    "### 10-million-point datashaded plots: interactive\n",
    "\n",
    "Although the above plots reveal the entire dataset at once, the full power of datashading requires an interactive plot, because a big dataset will usually have structure at very many different levels (such as different geographic regions).  Datashading allows auto-ranging and other automatic operations to be recomputed dynamically for the specific selected viewport, automatically revealing local structure that may not be visible from a global view.  Here we'll embed the generated images into a Bokeh plot to support fully interactive zooming.  For the highest detail on large monitors, you should increase the plot width and height above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datashader as ds\n",
    "from datashader.bokeh_ext import InteractiveImage\n",
    "from functools import partial\n",
    "from datashader.utils import export_image\n",
    "from datashader.colors import colormap_select, Greys9, Hot, inferno\n",
    "\n",
    "background = \"black\"\n",
    "export = partial(export_image, export_path=\"export\", background=background)\n",
    "cm = partial(colormap_select, reverse=(background==\"black\"))\n",
    "\n",
    "def create_image(x_range, y_range, w=plot_width, h=plot_height):\n",
    "    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)\n",
    "    agg = cvs.points(df, 'dropoff_x', 'dropoff_y',  ds.count('passenger_count'))\n",
    "    img = tf.shade(agg, cmap=Hot, how='eq_hist')\n",
    "    return tf.dynspread(img, threshold=0.5, max_px=4)\n",
    "\n",
    "p = base_plot(background_fill_color=background)\n",
    "export(create_image(*NYC),\"NYCT_hot\")\n",
    "InteractiveImage(p, create_image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can now zoom in interactively to this plot, seeing all the points available in that viewport, without ever needing to change the plot parameters for that specific zoom level.  Each time you zoom or pan, a new image is rendered (which takes a few seconds for large datasets), and displayed overlaid any other plot elements, providing full access to all of your data. Here we've used the optional `tf.dynspread` function to automatically enlarge the size of each datapoint once you've zoomed in so far that datapoints no longer have nearby neighbors.\n",
    "\n",
    "### Customizing datashader\n",
    "\n",
    "One of the most important features of datashading is that each of the stages of the datashader pipeline can be modified or replaced, either for personal preferences or to highlight specific aspects of the data.  Here we'll use a high-level `Pipeline` object that encapsulates the typical series of steps in the above `create_image` function, and then we'll customize it.  The default values of this pipeline are the same as the plot above, but here we'll add a special colormap to make the values stand out against an underlying map, and only plot hotspots (defined here as pixels (aggregation bins) that are in the 90th percentile by count): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from functools import partial\n",
    "\n",
    "def create_image90(x_range, y_range, w=plot_width, h=plot_height):\n",
    "    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)\n",
    "    agg = cvs.points(df, 'dropoff_x', 'dropoff_y',  ds.count('passenger_count'))\n",
    "    img = tf.shade(agg.where(agg>np.percentile(agg,90)), cmap=inferno, how='eq_hist')\n",
    "    return tf.dynspread(img, threshold=0.3, max_px=4)\n",
    "    \n",
    "p = base_plot()\n",
    "p.add_tile(STAMEN_TERRAIN)\n",
    "export(create_image(*NYC),\"NYCT_90th\")\n",
    "InteractiveImage(p, create_image90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you zoom in to the plot above, you can see that the 90th-percentile criterion at first highlights the most active areas in the entire dataset, and then highlights the most active areas in each subsequent viewport.  Here yellow has been chosen to highlight the strongest peaks, and if you zoom in on one of those peaks you can see the most active areas in that particular geographic region, according to this dynamically evaluated definition of \"most active\". \n",
    "\n",
    "The above plots each followed a roughly standard series of steps useful for many datasets, but you can instead fully customize the computations involved.  This capability lets you do novel operations on the data once it has been aggregated into pixel-shaped bins.  For instance, you might want to plot all the pixels where there were more dropoffs than pickups in blue, and all those where there were more pickups than dropoffs in red.  To do this, just write your own function that will create an image, when given x and y ranges, a resolution (w x h), and any optional arguments needed.  You can then either call the function yourself, or pass it to `InteractiveImage` to make an interactive Bokeh plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def merged_images(x_range, y_range, w=plot_width, h=plot_height, how='log'):\n",
    "    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)\n",
    "    picks = cvs.points(df, 'pickup_x',  'pickup_y',  ds.count('passenger_count'))\n",
    "    drops = cvs.points(df, 'dropoff_x', 'dropoff_y', ds.count('passenger_count'))\n",
    "    drops = drops.rename({'dropoff_x': 'x', 'dropoff_y': 'y'})\n",
    "    picks = picks.rename({'pickup_x': 'x', 'pickup_y': 'y'})\n",
    "    more_drops = tf.shade(drops.where(drops > picks), cmap=[\"darkblue\", 'cornflowerblue'], how=how)\n",
    "    more_picks = tf.shade(picks.where(picks > drops), cmap=[\"darkred\", 'orangered'], how=how)\n",
    "    img = tf.stack(more_picks, more_drops)\n",
    "    return tf.dynspread(img, threshold=0.3, max_px=4)\n",
    "\n",
    "p = base_plot(background_fill_color=background)\n",
    "export(merged_images(*NYC),\"NYCT_pickups_vs_dropoffs\")\n",
    "InteractiveImage(p, merged_images)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can see that pickups are more common on major roads, as you'd expect, and dropoffs are more common on side streets.  In Manhattan, roads running along the island are more common for pickups. If you zoom in to any location, the data will be re-aggregated to the new resolution automatically, again calculating for each newly defined pixel whether pickups or dropoffs were more likely in that pixel. The interactive features of Bokeh are now fully usable with this large dataset, allowing you to uncover new structure at every level. \n",
    "\n",
    "We can also use other columns in the dataset as additional dimensions in the plot.  For instance, if we want to see if certain areas are more likely to have pickups at certain hours (e.g. areas with bars and restaurants might have pickups in the evening, while apartment buildings may have pickups in the morning).  One way to do this is to use the hour of the day as a category, and then colorize each hour:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['hour'] = pd.to_datetime(df['tpep_pickup_datetime']).dt.hour.astype('category')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors = [\"#FF0000\",\"#FF3F00\",\"#FF7F00\",\"#FFBF00\",\"#FFFF00\",\"#BFFF00\",\"#7FFF00\",\"#3FFF00\",\n",
    "          \"#00FF00\",\"#00FF3F\",\"#00FF7F\",\"#00FFBF\",\"#00FFFF\",\"#00BFFF\",\"#007FFF\",\"#003FFF\",\n",
    "          \"#0000FF\",\"#3F00FF\",\"#7F00FF\",\"#BF00FF\",\"#FF00FF\",\"#FF00BF\",\"#FF007F\",\"#FF003F\",]\n",
    "\n",
    "def colorized_images(x_range, y_range, w=plot_width, h=plot_height, dataset=\"pickup\"):\n",
    "    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)\n",
    "    agg = cvs.points(df, dataset+'_x', dataset+'_y', ds.count_cat('hour'))\n",
    "    img = tf.shade(agg, color_key=colors)\n",
    "    return tf.dynspread(img, threshold=0.3, max_px=4)\n",
    "\n",
    "p = base_plot(background_fill_color=background)\n",
    "#p.add_tile(STAMEN_TERRAIN)\n",
    "export(colorized_images(*NYC, dataset=\"pickup\"),\"NYCT_pickup_times\")\n",
    "InteractiveImage(p, colorized_images, dataset=\"pickup\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "export(colorized_images(*NYC, dataset=\"dropoff\"),\"NYCT_dropoff_times\")\n",
    "p = base_plot(background_fill_color=background)\n",
    "InteractiveImage(p, colorized_images, dataset=\"dropoff\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the order of colors is roughly red (midnight), yellow (4am), green (8am), cyan (noon), blue (4pm), purple (8pm), and back to red (since hours and colors are both cyclic). There are clearly hotspots by hour that can now be investigated, and perhaps compared with the underlying map data.  And you can try first filtering the dataframe to only have weekdays or weekends, or only during certain public events, etc., or filtering the resulting pixels to have only those in a certain range of interest.  The system is very flexible, and it should be straightforward to express a very large range of possible queries and visualizations with very little code.\n",
    "\n",
    "The above examples each used pre-existing components provided for the datashader pipeline, but you can implement any components you like and substitute them, allowing you to easily explore and highlight specific aspects of your data. Have fun datashading!"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
